Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Validation and Proposal message relaying #3412

Closed
wants to merge 7 commits into from

Conversation

gregtatcam
Copy link
Collaborator

  • The motivation for the reduce relay is to reduce CPU and bandwidth cost and improve link latency.
  • Every node independently goes through the peer selection rounds at some random intervals.
  • MAX_PEERS are randomly selected from the pool of directly connected peers with the Validation and Proposal message count from a validator in {MESSAGE_LOW_THRESHOLD,MESSAGE_UPPER_THRESHOLD}. The peers are chosen to be the source of Validation and Proposal messages from this validator. Protocol message is sent to the rest of directly connected peers instructing them to squelch (stop relaying) the message for some random period of time in {MAX_UNSQUELCH_EXPIRE,MIN_UNSQUELCH_EXPIRE}.
  • When the squelch expires or when a new peer connects to the node, another peer selection round starts.
  • When a peer, which is selected to be the source of the messages for the given validator disconnects, or is idled then unsquelch message is sent to all squelched peers and another peer selection round starts.

@gregtatcam gregtatcam requested a review from seelabs May 21, 2020 21:10
@nbougalis nbougalis self-requested a review May 22, 2020 03:43
Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick first pass and had some minor comments. I will get better understanding of the overlay relay code, then finish the review later.

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor feedback on the structure of the PR:

It's easier for reviewers if the commits are organized into logical units. This PR has 70 commits, and those commits reflect the way the feature was developed. As a reviewer, I'm less interested in the timeline of how the feature was developed; I'm more interested in "here's a patch with all the XXX changes".

The biggest issue in this commit is making the Slots and Squlech classes generic (especially Slots). For non-test code they do not need to be generic and do not need callback functions; It would be a HUGE win for code understandably if we could somehow test this without making those classes generic.

* doesn't count messages in this state but a received message
* may transition Slot to Counting state.
*/
template <typename Peer, typename clock_type>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation to make this class generic - it help with testing. However, making this generic makes the code much harder to navigate and understand. We really need to think hard if there's another way to test the code that doesn't necessitate making this class generic.

This is not just a nit to me. Making this class use the concrete peer type, getting rid of all the generic callback functions, and getting rid of the comparator parameter would be a huge improvement in code understandably.

I do know we did something similar in the consensus code. Please think hard if there's another, better solution to the testing problem;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate the template all together. The idea is to introduce SquelchHandler abstract class, which declares the squelch/unsquelch handlers. Then implementation in Overlay and UnitTest encapsulate specific object Peer type and implement the handlers. We can provide on a fly handler for UnitTest where we want to have some flexibility. The only disadvantage that I see so far of this approach is that instead of storing Peer in the Slot we are storing SquelchHandler so we have to instantiate this object for every Peer and validator. So the additional memory footprint is nValidators * nPeers. The concept is explained in this code example: https://gist.github.com/gregtatcam/4c2a680ab3e3cd5095f130bfb329e19f

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, thanks for diving into this.

There are a couple of ways we could try to solve this:

  1. Templates and callbacks (the current approach). My biggest objection is ability to navigate and understand the code.

  2. New base class for Overlay. I haven't coded this up, but I suspect there isn't enough shared implementation between the test and non-test code. So we'll end up testing something very different from the production code.

  3. variant based wrapper. I like the concreteness of this, but the test code requires several wrapped types. This solution works best if there was only one (or very few) wrapped types.

  4. type erased wrapper (the approach in your patch). This requires an extra allocation, but it is an improvement over the template based approach from a code navigability POV. We may be able to remove the allocation with small object optimization techniques.

  5. Write a test framework that can handle the Overlays and Peers. This by far gives the cleanest, most understandable code. But we (1) don't have that framework and (2) unit tests are still valuable.

Given the above, I vote that we go forward with your proposed patch. However, you might want to wait to hear from other reviewers to see if they have other ideas on this issue as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can implement Peer class in unit tests with most of the methods simply returning some value. This is not going to be the test framework or a step towards implementing the test framework but a way to avoid the extra allocation in (4). We can then pass std::weak_ptr<Peer> to all applicable slot methods as we do today with Peer being a concrete type so no template is required. We will have abstract class SquelchHandler with squelch and unsquelch methods. The class is going to be implemented in Overlay and UnitTest. Reference to the base class is going to be passed to Slots and Slot constructors and squelch and unsquelch called as needed. UnitTest can overwrite the handling on a fly providing flexibility of different handlers as needed. Does this work?

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job getting rid of the templates and callbacks.

I left another set of minor comments, but I suspect that's the last batch from me.

* unsquelched, disconnected peer, or idling peer may transition Slot to
* Counting state.
*/
template <typename clock_type>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this class is simplified to take a single template parameter, consider moving the implementation to a .cpp file and explicitly instantiating the clocks we need for linking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we have to move ManualClock from the unit test to the Slot's implementation in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd have to move the ManualClock to a common header file, and we normally would not include a test file like that in the main code (tho I think we have a good reason in this case). But let's not worry about that now, we can do that later in another PR if we want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Let a couple minor comments that do need to be addressed (a typo and the order of includes), but those changes are so trivial I'm OK approving it now.

Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the code with several test cases. They worked fine.

@manojsdoshi manojsdoshi added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Jun 25, 2020
@manojsdoshi manojsdoshi mentioned this pull request Jun 25, 2020
@manojsdoshi manojsdoshi reopened this Jun 30, 2020
@carlhua carlhua requested review from pwang200 and seelabs July 8, 2020 15:36
@seelabs seelabs removed the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Jul 22, 2020
Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest patch, left some comments.

Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me confirm my understanding.

  1. we count and only count for relayed messages, no matter if the messages are trusted or not.
  2. for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.
  3. there are two code paths that peers are counted by the slots logic:
    -- if they are recorded by the hashRouter before the relayed timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
    -- (note: thread safety (for the atomic operation between the relayed timestamp and the set of peers) is protected by the mutex in hashRouter)
    -- else (after the relayed timestamp is set), the peers are counted when their messages are received.
    -- (note: for both of the cases, peers are counted only if the message is relayed)
  4. the container (peersWithMessage_) in slots is only for preventing over counting.
  5. if a peer forwards the message too late (8 seconds after the relayed timestamp was set), then the peer is not counted, which is OK.

* the message has been relayed.
* @param key Unique message's key
* @param validator Validator's public key
* @param id Peers id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be peers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* have already seen the messge; i.e.
* the message has been received from
* these peers and added to the hash
* router. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Return the set of peers from which the local node has received the message." accurate?

It is existing behavior, but could you please put a note saying this function will relay valid proposal messages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


template <typename clock_type>
void
Slots<clock_type>::unsquelch(id_t const id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of this function? I see this function is called by OverlayImpl::deletePeer(Peer::id_t const id). I feel it might be easier to understand if deletePeer(id, true); was called directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

@gregtatcam gregtatcam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me confirm my understanding.

  1. we count and only count for relayed messages, no matter if the messages are trusted or not.
  2. for the same message, all the peers forwarded the message to the local node are recorded by the hashRouter.
  3. there are two code paths that peers are counted by the slots logic:
    -- if they are recorded by the hashRouter before the relayed timestamp is set, they are counted right after the local node try to relay the message. This also works for the case that the hashRouter has recorded all the peers (i.e. the message won't actually be sent.)
    -- (note: thread safety (for the atomic operation between the relayed timestamp and the set of peers) is protected by the mutex in hashRouter)
    -- else (after the relayed timestamp is set), the peers are counted when their messages are received.
    -- (note: for both of the cases, peers are counted only if the message is relayed)
  4. the container (peersWithMessage_) in slots is only for preventing over counting.
  5. if a peer forwards the message too late (8 seconds after the relayed timestamp was set), then the peer is not counted, which is OK.

Yes, this is correct understanding on all points.

* unsquelched, disconnected peer, or idling peer may transition Slot to
* Counting state.
*/
template <typename clock_type>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have a test file in the main code. I think a better way is to get rid of the clock template parameter altogether, in another PR.

@gregtatcam gregtatcam requested review from pwang200 and seelabs August 3, 2020 19:28
Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good! (left one minor comment about a journal param, but marking this approved now).

Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@pwang200 pwang200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still good.

Copy link
Collaborator

@seelabs seelabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything obviously wrong with this PR, but I think the code is a bit hairy. This isn't a criticism of your changes @gregtatcam; the code was already hairy when you started working on this. Separately from this PR, we should try to revisit how we do some things, especially the HashRouter.

Message(
::google::protobuf::Message const& message,
int type,
boost::optional<PublicKey> validator = {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this parameter in the comment above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented. This is what it currently states:
@param validator Public Key of the source validator for Validation or Proposal message. Used to check if the message should be squelched.

{
if (!m->has_validatorpubkey())
{
JLOG(p_journal_.debug()) << "onMessage: TMSquelch, missing public key";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be logging such malformed messages; rather we should just charge the peer for sending us bad data and we should drop the message on the floor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -179,6 +179,12 @@ class Config : public BasicConfig
// Thread pool configuration
std::size_t WORKERS = 0;

// Reduce-relay
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that these parameters are experimental.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/** Relay a validation. */
virtual void
relay(protocol::TMValidation& m, uint256 const& uid) = 0;
/** Relay a proposal. Return set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer doxygen format:

/** Relay a proposal.

    @param m the serialized proposal
    @param uid the id used to identify this proposal
    @param validator The pubkey of the validator that issued this proposal
    @return the set of peers which have already sent us this proposal
*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -99,6 +99,7 @@ OverlayImpl::Timer::on_timer(error_code ec)
overlay_.m_peerFinder->once_per_second();
overlay_.sendEndpoints();
overlay_.autoConnect();
overlay_.deleteIdlePeers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to run every second?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to once every 4 seconds. A peer is considered idle if no message is received in 8 seconds.

@@ -49,12 +49,19 @@ HashRouter::addSuppression(uint256 const& key)

bool
HashRouter::addSuppressionPeer(uint256 const& key, PeerShortID peer)
{
auto [added, _] = addSuppressionPeerWithStatus(key, peer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simply: return addSuppressionPeerWithStatus(key, peer).first;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

std::pair<bool, std::optional<Stopwatch::time_point>>
HashRouter::addSuppressionPeerWithStatus(const uint256& key, PeerShortID peer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HashRouter API was already confusing enough. It's now become a little too confusing. I don't have an improvement to offer, but we need to think about this.

@carlhua carlhua added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Aug 24, 2020
A server may receive multiple copies of a message from
its directly connected peers. We select three peers as
the source of validation and proposal messages from
the validator and "squelch" other peers.
This change reduces CPU and bandwidth costs and improves
link latency.
Charge for bad TMSquelch
Update comments
@gregtatcam gregtatcam requested a review from nbougalis August 26, 2020 17:32
@nbougalis nbougalis mentioned this pull request Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants